Creating your own dataset from Google Images

by: Francisco Ingham and Jeremy Howard. Inspired by Adrian Rosebrock

In this tutorial we will see how to easily create an image dataset through Google Images. Note: You will have to repeat these steps for any new category you want to Google (e.g once for dogs and once for cats).

Get a list of URLs

Search and scroll

Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button. Then continue scrolling until you cannot scroll anymore. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:

"canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.

Download into file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

Press CtrlShiftJ in Windows/Linux and CmdOptJ in Mac, and a small window the javascript 'Console' will appear. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. You can do this by running the following commands:

urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

Create directory and upload urls file into your server

In [1]:
from fastai import *
from fastai.vision import *

Choose an appropriate name for your labeled images. You can run these steps multiple times to grab different labels.

In [2]:
folder = 'panda'
file = 'urls_panda.txt'
In [3]:
folder = 'koala'
file = 'urls_koala.txt'
In [4]:
folder = 'kangaroo'
file = 'urls_kangaroo.txt'
In [5]:
path = Path('data/animals')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

Finally, upload your urls file. You just need to press Upload in your working directory and select your file, then click 'upload' on the right.

Download images

In [6]:
folder, file, path, dest
Out[6]:
('kangaroo',
 'urls_kangaroo.txt',
 PosixPath('data/animals'),
 PosixPath('data/animals/kangaroo'))

Now you will need to download you images from their respective urls.

fast.ai has a function that allows you to do just that. You just have to specify the urls filename and the destination folder and this function will download and save all images than can be opened. If they have some problem in being opened, they will not be saved.

Let's download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.

In [ ]:
download_images(path/folder/file, dest, max_pics=300)

Good! Let's take a look at some of our pictures then.

In [7]:
classes = ['kangaroo','panda','koala']
In [8]:
# This function will delete the txt files for each animal, leaving me with just images within each animal folder
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_workers=8)
kangaroo
100.00% [274/274 00:06<00:00]
/opt/anaconda3/lib/python3.6/site-packages/PIL/Image.py:953: UserWarning: Palette images with Transparency   expressed in bytes should be converted to RGBA images
  'to RGBA images')
[Errno 21] Is a directory: '/home/jupyter/tutorials/fastai/course-v3/nbs/dl1/data/animals/kangaroo/.ipynb_checkpoints'
panda
100.00% [257/257 00:04<00:00]
[Errno 21] Is a directory: '/home/jupyter/tutorials/fastai/course-v3/nbs/dl1/data/animals/panda/.ipynb_checkpoints'
koala
100.00% [250/250 00:04<00:00]
/opt/anaconda3/lib/python3.6/site-packages/PIL/Image.py:953: UserWarning: Palette images with Transparency   expressed in bytes should be converted to RGBA images
  'to RGBA images')
[Errno 21] Is a directory: '/home/jupyter/tutorials/fastai/course-v3/nbs/dl1/data/animals/koala/.ipynb_checkpoints'
In [ ]:
doc(verify_images)

View data

In [9]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
In [10]:
data.classes
Out[10]:
['kangaroo', 'koala', 'panda']
In [11]:
data.show_batch(rows=4, figsize=(7,8))
In [12]:
data.classes, data.c, len(data.train_ds), len(data.valid_ds)
Out[12]:
(['kangaroo', 'koala', 'panda'], 3, 602, 176)

Train model

In [13]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)
In [14]:
learn.fit_one_cycle(4)
Total time: 00:50
epoch  train_loss  valid_loss  error_rate
1      0.663003    0.081272    0.039773    (00:14)
2      0.358292    0.085185    0.028409    (00:12)
3      0.248358    0.099006    0.034091    (00:12)
4      0.181493    0.084635    0.028409    (00:11)

In [15]:
learn.save('stage-1')
In [16]:
learn.unfreeze()
In [17]:
learn.lr_find()
LR Finder complete, type {learner_name}.recorder.plot() to see the graph.
In [18]:
learn.recorder.plot()
In [19]:
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))
Total time: 00:23
epoch  train_loss  valid_loss  error_rate
1      0.057812    0.079823    0.022727    (00:11)
2      0.061857    0.109965    0.034091    (00:11)

In [20]:
learn.save('stage-2')

Interpretation

In [21]:
learn.load('stage-2')
In [22]:
interp = ClassificationInterpretation.from_learner(learn)
In [23]:
interp.plot_confusion_matrix()
In [27]:
interp.plot_top_losses(6)
In [28]:
interp.most_confused()
Out[28]:
[('panda', 'koala', 3), ('kangaroo', 'koala', 2)]

Cleaning Up

Some of our top losses aren't due to bad performance by our model. There are images in our data set that shouldn't be.

Using the FileDeleter widget from fastai.widgets we can prune our top losses, removing photos that don't belong.

First we need to get the file paths from our top_losses. Here's a handy function that pulls out all our top_losses:

In [ ]:
from fastai.widgets import *

losses,idxs = interp.top_losses()
top_loss_paths = data.valid_ds.x[idxs]

Now we can pass in these paths to our widget.

In [ ]:
fd = FileDeleter(file_paths=top_loss_paths)

Flag photos for deletion by clicking 'Delete'. Then click 'Confirm' to delete flagged photos and keep the rest in that row. The File_Deleter will show you a new row of images until there are no more to show. In this case, the widget will show you images until there are none left from top_losses.

Putting your model in production

In [29]:
data.classes
Out[29]:
['kangaroo', 'koala', 'panda']

You probably want to use CPU for inference, except at massive scale (and you almost certainly don't need to train in real-time). If you don't have a GPU that happens automatically. You can test your model on CPU like so:

In [30]:
# fastai.defaults.device = torch.device('cpu')
In [31]:
img = open_image(path/'kangaroo'/'00000021.jpg')
img
Out[31]:
In [32]:
classes = ['kangaroo', 'koala', 'panda']
data2 = ImageDataBunch.single_from_classes(path, classes, tfms=get_transforms(), size=224).normalize(imagenet_stats)
learn = create_cnn(data2, models.resnet34)
learn.load('stage-2')
In [33]:
pred_class,pred_idx,outputs = learn.predict(img)
pred_class
Out[33]:
'kangaroo'

Things that can go wrong

  • Most of the time things will train fine with the defaults
  • There's not much you really need to tune (despite what you've heard!)
  • Most likely are
    • Learning rate
    • Number of epochs

Learning rate (LR) too high

In [34]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)
In [35]:
learn.fit_one_cycle(1, max_lr=0.5)
Total time: 00:12
epoch  train_loss  valid_loss  error_rate    
1      15.211160   5814127.500000  0.659091    (00:12)

Learning rate too low

In [36]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)
In [37]:
learn.fit_one_cycle(5, max_lr=1e-5)
Total time: 01:01
epoch  train_loss  valid_loss  error_rate
1      1.577221    1.306330    0.710227    (00:12)
2      1.574153    1.270535    0.710227    (00:10)
3      1.533494    1.233013    0.647727    (00:11)
4      1.513823    1.228780    0.636364    (00:14)
5      1.500860    1.229382    0.642045    (00:12)

In [38]:
learn.recorder.plot_losses()

As well as taking a really long time, it's getting too many looks at each image, so may overfit.

Too few epochs

In [39]:
learn = create_cnn(data, models.resnet34, metrics=error_rate, pretrained=False)
In [40]:
learn.fit_one_cycle(1)
Total time: 00:11
epoch  train_loss  valid_loss  error_rate
1      1.216908    3.202386    0.625000    (00:11)

Too many epochs

In [ ]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.9, bs=32, 
        ds_tfms=get_transforms(do_flip=False, max_rotate=0, max_zoom=1, max_lighting=0, max_warp=0
                              ),size=224, num_workers=4).normalize(imagenet_stats)
In [ ]:
learn = create_cnn(data, models.resnet50, metrics=error_rate, ps=0, wd=0)
learn.unfreeze()
In [ ]:
learn.fit_one_cycle(40, slice(1e-6,1e-4))